ner dataset
Human-Annotated NER Dataset for the Kyrgyz Language
Turatali, Timur, Alekseev, Anton, Jumalieva, Gulira, Kabaeva, Gulnara, Nikolenko, Sergey
We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.
- South America > Argentina (0.14)
- Asia > Kyrgyzstan > Chüy Region > Bishkek (0.05)
- Asia > Russia (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)
ESNERA: Empirical and semantic named entity alignment for named entity dataset merging
Zhang, Xiaobo, He, Congqing, He, Ying, Peng, Jian, Fu, Dajie, Tan, Tien-Ping
Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (11 more...)
- Health & Medicine (1.00)
- Information Technology (0.67)
Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI
Lee, Junhyeong, Yuk, Jong Min, Lee, Chan-Woo
The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent adva nces in natural language pro cessing (NLP) have facilitated automatic extraction of structured data from uns tructured scientific literature. While existing approaches--multi-step and direct methods--offer va luable capabilities, they also come with limitations when applied independently. He re, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first tran sforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recogniti on performance by introducing an entity marker--a simple yet effective technique that uses sym bolic annotations to highlight target entities. Specifically, our entity marker-based hybrid approach not onl y consistently outperforms previous entity recognition approaches across three benchmark datasets (MatScholar, SOFC, and SOFC slot NER) but also improve the quality of final st ructured data--yielding up to a 58% improvement in entity-level F1 score and up to 83% improveme nt in relation-level F1 score compared to direct approach.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > South Korea > Daejeon > Daejeon (0.04)
- North America > Dominican Republic (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Workflow (0.69)
- Research Report (0.64)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.97)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.52)
EDU-NER-2025: Named Entity Recognition in Urdu Educational Texts using XLM-RoBERTa with X (formerly Twitter)
Ullah, Fida, Ahmad, Muhammad, Zamir, Muhammad Tayyab, Arif, Muhammad, sidorov, Grigori, Riverón, Edgardo Manuel Felipe, Gelbukh, Alexander
Named Entity Recognition (NER) plays a pivotal role in various Natural Language Processing (NLP) tasks by identifying and classifying named entities (NEs) from unstructured data into predefined categories such as person, organization, location, date, and time. While extensive research exists for high-resource languages and general domains, NER in Urdu particularly within domain-specific contexts like education remains significantly underexplored. This is Due to lack of annotated datasets for educational content which limits the ability of existing models to accurately identify entities such as academic roles, course names, and institutional terms, underscoring the urgent need for targeted resources in this domain. To the best of our knowledge, no dataset exists in the domain of the Urdu language for this purpose. To achieve this objective this study makes three key contributions. Firstly, we created a manually annotated dataset in the education domain, named EDU-NER-2025, which contains 13 unique most important entities related to education domain. Second, we describe our annotation process and guidelines in detail and discuss the challenges of labelling EDU-NER-2025 dataset. Third, we addressed and analyzed key linguistic challenges, such as morphological complexity and ambiguity, which are prevalent in formal Urdu texts.
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > Switzerland (0.04)
EIoU-EMC: A Novel Loss for Domain-specific Nested Entity Recognition
Zhang, Jian, Zhang, Tianqing, Li, Qi, Wang, Hongwei
In recent years, research has mainly focused on the general NER task. There still have some challenges with nested NER task in the specific domains. Specifically, the scenarios of low resource and class imbalance impede the wide application for biomedical and industrial domains. In this study, we design a novel loss EIoU-EMC, by enhancing the implement of Intersection over Union loss and Multiclass loss. Our proposed method specially leverages the information of entity boundary and entity classification, thereby enhancing the model's capacity to learn from a limited number of data samples. To validate the performance of this innovative method in enhancing NER task, we conducted experiments on three distinct biomedical NER datasets and one dataset constructed by ourselves from industrial complex equipment maintenance documents. Comparing to strong baselines, our method demonstrates the competitive performance across all datasets. During the experimental analysis, our proposed method exhibits significant advancements in entity boundary recognition and entity classification. Our code are available here.
- Europe > Italy (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition
Yazdani, Anthony, Stepanov, Ihor, Teodoro, Douglas
Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (3 more...)
NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER
Atuhurra, Jesse, Kamigaito, Hidetaka, Ouchi, Hiroki, Shindo, Hiroyuki, Watanabe, Taro
Adapting named entity recognition (NER) methods to new domains poses significant challenges. We introduce RapidNER, a framework designed for the rapid deployment of NER systems through efficient dataset construction. RapidNER operates through three key steps: (1) extracting domain-specific sub-graphs and triples from a general knowledge graph, (2) collecting and leveraging texts from various sources to build the NERsocial dataset, which focuses on entities typical in human-robot interaction, and (3) implementing an annotation scheme using Elasticsearch (ES) to enhance efficiency. NERsocial, validated by human annotators, includes six entity types, 153K tokens, and 99.4K sentences, demonstrating RapidNER's capability to expedite dataset creation.
- North America > United States > Virginia (0.04)
- Asia > India (0.04)
- South America > Peru (0.04)
- (33 more...)
- Media > News (1.00)
- Media > Music (1.00)
- Leisure & Entertainment > Sports > Motorsports (1.00)
- (14 more...)
MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities
Durmaz, Ali Riza, Thomas, Akhil, Mishra, Lokesh, Murthy, Rachana Niranjan, Straub, Thomas
While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (3 more...)
- Information Technology (0.68)
- Materials > Metals & Mining (0.46)
- Energy (0.46)
- Media > Music (0.34)
Annotation Errors and NER: A Study with OntoNotes 5.0
Bernier-Colborne, Gabriel, Vajjala, Sowmya
Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.
- Asia > China > Hong Kong (0.04)
- Oceania > Australia (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (12 more...)
Curating Grounded Synthetic Data with Global Perspectives for Equitable AI
Törnquist, Elin, Caulk, Robert Alexander
The development of robust AI models relies heavily on the quality and variety of training data available. In fields where data scarcity is prevalent, synthetic data generation offers a vital solution. In this paper, we introduce a novel approach to creating synthetic datasets, grounded in real-world diversity and enriched through strategic diversification. We synthesize data using a comprehensive collection of news articles spanning 12 languages and originating from 125 countries, to ensure a breadth of linguistic and cultural representations. Through enforced topic diversification, translation, and summarization, the resulting dataset accurately mirrors real-world complexities and addresses the issue of underrepresentation in traditional datasets. This methodology, applied initially to Named Entity Recognition (NER), serves as a model for numerous AI disciplines where data diversification is critical for generalizability. Preliminary results demonstrate substantial improvements in performance on traditional NER benchmarks, by up to 7.3%, highlighting the effectiveness of our synthetic data in mimicking the rich, varied nuances of global data sources. This paper outlines the strategies employed for synthesizing diverse datasets and provides such a curated dataset for NER.
- Europe > France (0.28)
- South America > Venezuela (0.04)
- South America > Uruguay (0.04)
- (122 more...)
- Government (1.00)
- Leisure & Entertainment > Sports > Football (0.68)
- Energy (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)